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We introduce and evaluate a set of feature vector representations of crystal structures for machine 
learning (ML) models of formation energies of solids. ML models of atomization energies of organic 
molecules have been successful using a Coulomb matrix representation of the molecule. We consider 
three ways to generalize such representations to periodic systems: (i) Si matrix where each element is 
related to the Ewald sum of the electrostatic interaction between two different atoms in the unit cell 
repeated over the lattice; (ii) an extended Coulomb-like matrix that takes into account a number 
of neighboring unit cells; and (Hi) an ansatz that mimics the periodicity and the basic features of 
the elements in the Ewald sum matrix by using a sine function of the crystal coordinates of the 
atoms. The representations are compared for a Laplacian kernel with Manhattan norm, trained to 
reproduce formation energies using a data set of 3938 crystal structures obtained from the Materials 
Project. Eor training sets consisting of 3000 crystals, the generalization error in predicting formation 
energies of new structures corresponds to (i) 0.49, (ii) 0.64, and (iii) 0.37 eV/atom for the respective 
representations. 


I. INTRODUCTION 

First-principles simulations for the prediction of prop¬ 
erties of chemical and materials systems have become a 
standard tool throughout theoretical chemistry, physics 
and biology. These simulations are based on repeat¬ 
edly solving numerical approximations to the underly¬ 
ing many-body problem, z.e., the Schrodinger equation. 
However, in the pursuit of ever increasing efficiency, there 
is a growing interest in side-stepping the physics-based 
formulation of the problem, and instead exploit big data 
methodology, e.g., artificial intelligence, evolutionary and 
machine learning (ML) schemes. If successful, such ap¬ 
proaches can lead to an orders-of-magnitude increase 
in computational efficiency for describing properties of 
atomistic systems. Such a change would not merely 
bring incremental progress, but certainly represents a 
paradigm-shift in terms of enabling the study of systems 
and problems which hitherto have been completely out 
of reach. 

Recently, one of us has presented an ML scheme 
that predicts atomization energies of new (out-of-sample) 
molecules with an accuracy even beyond that of the most 
common first-principles method, density functional the- 
ory [IHl]. Subsequently, it was shown that this approach 
also works for other properties such as molecular polar¬ 
izability and frontier orbital eigenvalues [5] , transmission 
coefficients of electron transport in doped nano-ribbon 
models [6], and even densities of state within the An- 
dersion impurity model [7]. A critical component of any 
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such ML scheme is the feature vector representation of 
the data set, also often called the descriptor 019]; i.e., 
the mapping of the atomistic description of the system 
into a form suitable for matrix operations. In the cited 
work [1] , an atom by atom matrix, dubbed Coulomb ma¬ 
trix, was used with diagonal and off-diagonal elements 
representing the potential of the free atom and the inter¬ 
atomic Coulomb repulsion between nuclear charges. 

The aim of the present study is to consider various 
ways of adapting this representation to periodic systems, 
and to compare their performance in ML models of for¬ 
mation energies of solids. We have considered three gen¬ 
eralizations of the Coulomb matrix idea; (i) a matrix 
where each element is related to the Ewald sum of the 
electrostatic interaction between two different atoms in 
the unit cell repeated over the lattice; (ii) an extended 
Coulomb-like matrix that takes into account a number of 
neighboring unit cells, and (iii) a simplified matrix ansatz 
chosen to mimic the periodicity and basic features of the 
elements in the Ewald sum matrix using a sine function 
of atomic coordinates. 

A few studies have already considered representations 
of periodic systems suitable for ML. Schutt et al. trained 
an ML model using radial distribution functions to pre¬ 
dict the density of states at the Fermi level of solids [4]. 
Meredig et al. used a heuristics based ML model to 
screen 1.6 M ternary solids for stability m- Ram- 
prasad and co-workers, relying on chemo-structural fin¬ 
gerprints in ML models of formation energies (and other 
properties) of polymers, obtained scatter plots [TT] for 
which visual inspection suggests an error on the order 
of magnitude of ^0.5 eV. Bartok et al. used ML to en¬ 
hance the computer simulation of molecular materials 
[12]. However, to the best of our knowledge no clear 
example has been presented so far for directly applying 
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ML to reproduce cell-, cohesive-, or formation energies 
based on an atom by atom matrix based representation 
of the crystal structure. We present such an applica¬ 
tion based on the ML methods which have shown good 
performance for molecules. We find that all represen¬ 
tations investigated in this work yield similar accuracy 
(0.4 —0.6 eV/atom at 3000 crystals), which is less impres¬ 
sive than the accuracy found for atomization energies of 
molecules ( 0.02 eV/atom for training sets of similar size). 
However, our results indicate that if the data set used for 
training can be further expanded, a machine capable of 
estimating formation energies at near first-principles ac¬ 
curacy is still within reach. We briefly discuss the reason 
for the disparity in performance between finite and peri¬ 
odic systems, and how this may be resolved. 

This paper is organized as follows. In Sec. II we briefly 
present the kernel ridge regression (KRR) scheme. In 
Sec. Ill we introduce our various representations of crys¬ 
tal structures. In Sec. IV some technical details are given 
regarding the implementation. In Sec. V we analyze the 
representations and test their performance in machines 
trained on a data set of formation energies of solids. In 
Sec. VI we discuss our results, and finally, in Sec. VII 
this study is summarized and concluded. 

II. KERNEL RIDGE REGRESSION 

We present a short summary of the ridge regression 
method [13] and KRR [HI [15] , defining our notation. In 
ridge regression one seeks to approximate an unknown 
function y{x) where x is a feature-vector representation of 
some input (e.g., a molecule or crystal system). We start 
from a set of n data points y = (^i, ^ 2 , ^n)^ known 

for a set of feature vectors x = (xi, X 2 ,..., x^)^, where we 
use the linear-algebra convention of distinguishing row 
and column vectors, and is the transpose of a. 

An approximation f{x, a.) is constructed as 

n 

f{x, a) = k(a:;)a = kj{x)aj ( 1 ) 

i=i 

a = {ai,a2,:.,an) ( 2 ) 

k(x) = (fci(x), k2{x),kn{x))'^, (3) 

where kj{x) is a function ansatz of x which is the same 
for all j, and to be chosen freely, usually assumed to 
be polynomial. The approximation /(x, a) is optimized 
by seeking the ol that minimizes the Euclidean norm 
||y — k(x)a|| 2 . The solution is given by the normal equa¬ 
tion 

a = (k(x)^k(x))-ik(x)^y. (4) 

If an ansatz for k(x) with a high degree of freedom is 
used, there is considerable risk of overfitting, z.e., one ar¬ 
rives at a model that fits the training data set well, but 
still performs poorly when predicting y for out-of-sample 
cases (feature vectors x that are not included in the train¬ 
ing set X.) Several techniques are used to address this 


problem, e.^., cross-validation and regularization. In the 
case of the latter, Eq. 0 is modified by inserting an extra 
term, 

a = (k(x)'^k(x) + A/)“^k(x)'^y, (5) 

where A is the regularization parameter and / the identity 
matrix. 

Ridge regression is extended into KRR by introducing 
a map, from a non-linear space to a linear space 
that is known as the feature space [TGHTS] . The map¬ 
ping is usually non-trivial, but does not need to be 
known explicitly since it is sufficient to know the inner 
product between the mapped data, (4>^(xi), 4>^(xj)) = 
{xj)i = k{xi,Xj) where k{xi^Xj) is the 
kernel. This procedure is also known as the kernel triek 
US]. The function f{x, a.) now takes the form 

n 

f{x,a) ='^aik(x,Xi) (6) 

i 

and the approximation is optimized by seeking the min¬ 
imum of 

n n 

^iVi - f{xi,a)f + '^aik{xj,Xi)aj, ( 7 ) 

* hi 

with solution 

a = (K + A/)-V, (8) 

where K = k{xi^Xj) is the kernel matrix. Different 
choices of kernel functions are appropriate for different 
ML applications. Two of the more commonly employed 
kernel functions are, 

Gaussian: k{xi,Xj) = ) ( 9 ) 

Laplacian: k{xi,Xj) = (10) 

defined as using the Euclidean (L^) and Manhattan (L^) 
norm, respectively. Here a represents the kernel width, a 
measure of locality in the employed training set. Without 
any loss of generality we restrict ourselves to the Lapla¬ 
cian kernel for this study. This choice is motivated by the 
fact that it has shown good performance in the context of 
predicting molecular atomization energies [2]. However, 
the Gaussian, or other kernels, could have been used just 
as well. 

As a relevant benchmark of performance of the ML 
model with a given representation we use the error in 
predicting new data outside x, the generalization error 
(GE). There are several ways to estimate the GE. We 
exclude some subset of all available data x (and y) from 
the solution of Eq. 0 and then evaluate the mean abso¬ 
lute error (MAE) for the prediction of the excluded data. 
The subset of x used in Eq. ^ is called the training set 
Xtrain, and the remaining input the test set Xtest- Hence, 
we define 

MAEtest = - V \yi-f{xi,cx)\, (11) 
m . 

^CXtest 
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where m is the size of the test set. The measure MAEtest 
determines how well the machine predicts new data. The 
number of systems included in Xtest needs to be suffi¬ 
ciently large for this measure to be an accurate estimate 
of the GE. Hence, there is a trade-off between using avail¬ 
able data as part of Xtest oi* xtrain- We alleviate this prob¬ 
lem by a statistical validation method known as /c-fold 
cross-validation following the recipes in Ref. [ 2 j combined 
with random sampling. The MAEtest is calculated as the 
mean of the MAEtest of several independent ML runs. 
In each run Xtest is constructed by selecting a specific 
number of entries randomly out of the full data set. 

Similarly, MAEtrain is defined by substituting Xtrain for 
Xtest in Eq. 0- This measure determines the precision 
with which the machine reproduces the data it has been 
trained on. A high MAEtrain suggests the machine has 
too few degrees of freedom to describe the data well. A 
low MAEtrain suggcsts high flexibility of the fitting func¬ 
tion, a potential risk of overfitting, and that it may be 
possible to reach good transferability with a sufficiently 
large data set. 

III. REPRESENTATIONS 

The feature vector representation used for the input is 
of central importance for the ML scheme. Here, this is 
the mapping of the positions and identities of all atoms 
into a vector x. We expect a well-suited representation to 
exhibit the following beneficial properties (with examples 
in parenthesis given for molecules) 

1 . Complete, non-degenerate: x should incorporate all 
features in the input that are relevant for the un¬ 
derlying problem. (Two different molecules should 
give different x.) 

2 . Compaet, unique: x should have minimal features 
that are redundant to the underlying problem. 
(Two instances of the same molecule, but rotated 
differently, should give the same x.) 

3. Deseriptive: instances of input that are ‘close’, giv¬ 
ing similar y, should generally be represented by x 
that are close, in the sense of a small \\xi — X 2 \\. 
(Two molecules that are identical except for small 
differences in the atomic positions, which have sim¬ 
ilar y, should generally have similar x.) 

4. Simple: generating the representation for a given 
input should require as little computational effort 
as possible. (The set of eigenvalues from solu¬ 
tion of the many-body Schrodinger equation of the 
molecule would generally not be a useful x.) 

A bijective representation is perfectly non-degenerate 
and unique, ie., it has a one-to-one correspondence be¬ 
tween the input and x. In this discussion, the distinc¬ 
tion made by ‘relevant for the underlying problem’ is im¬ 
portant, because different features of the input may be 


relevant for different applications. Eor example, in ML 
models of atomization energies the chirality of a molecule 
is not relevant, but in ML models of the optical activity 
in circular dichroism it is. 

A common method for analysis of the properties of 
different feature vector representations is principal com¬ 
ponent analysis (PCA) [20]. This method reduces the di¬ 
mensionality of a data set while still conserving as much 
information as possible. To generate the PC A, a singular 
value decomposition [ 21 ] is performed on x, 

X = UEW^. (12) 

Submatrices are created from the first k columns ex¬ 
tracted from the resulting matrices, and The 

PC A data set of reduced rank k is then defined as 

z = Ufc5]^. (13) 

This data set is not only useful for visualization. If one 
suspects that a feature vector representation is not suf¬ 
ficiently compact (in the sense of point 2 in the list of 
beneficial properties given above in this section), one can 
try to replace x with the PC A feature vector represen¬ 
tation, z, to extract a representative lower-dimensional 
part. Note that the PC A representation should only be 
constructed on the training set. The test set may not be 
used. 


A. Molecular Coulomb matrix 

The present work aims at extending the ML model for 
molecular properties minis] to properties of crystals. 
We therefore briefly summarize the molecular approach 
in the following. The feature vector representation called 
the Coulomb matrix is a symmetric atom by atom matrix 
given in Hartree atomic units, 

0.5Z2-4 

ZiZj<j>(\\vi-Vj\\ 2 ) iii^j 
(j){r) = 1 /r (15) 

where (j){r) is the Coulomb potential, and are 
the atomic number and position of the i^^ atom. The 
non-diagonal elements correspond thus to the pair-wise 
Coulomb repulsion between the positive atomic cores in 
the system, and the diagonal elements are chosen by con¬ 
struction as the result of an exponential fit to the poten¬ 
tial energy of a free atom. 

While this representation is not bijective, it is non¬ 
degenerate in the sense that no two molecules that differ 
more from each other than being enatiomers will yield the 
same Coulomb matrix. One molecule, however, can re¬ 
sult in several Coulomb matrices due to the various ways 
of ordering atoms. Atom index invariance can be intro¬ 
duced by ordering atom indices according to the norm of 
each row (or column) [HIS] , or by using permuted sets of 
Coulomb matrices [5]. Sorting, however, introduces the 
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issue of differentiability—a potentially desirable property 
when it comes to the modeling of atomic forces. Use of 
the (differentiable) eigenvalue spectrum of the Coulomb 
matrix yields an atom index invariant descriptor at the 
expense of losing uniqueness [22j [23] . There is another 
atom index invariant and differentiable representation, 
the Fourier series of atomic radial distribution functions, 
that also preserves uniqueness as well as spatial invari¬ 
ances [8|. However, this representation has not yet been 
explored in depth and has therefore not been included in 
the representations discussed herewithin. 

To benchmark our results for solids, we compare to the 
performance for the molecular ML model that has been 
trained and tested on (subsets of) the GDB-database in 
previous studies p1 [MH26] . We use a subset of 7165 en¬ 
tries out of the GDB-13 data set. These entries are all 
the organic molecules in this set with up to 7 atoms of 
elements G, O, N, and S, and valencies satisfied by hydro¬ 
gen atoms. The atomic coordinates were relaxed using 
atomic force fields, and the atomization energies calcu¬ 
lated with a higher-order first-principles method, density 
functional theory [27l [28] using the PBEO hybrid func¬ 
tional [29H3T] . We call this dataset QM7 [24f[26l [32] . 

When considering ways of extending the Coulomb ma¬ 
trix to periodic systems one might be tempted to take 
the Coulomb matrix for just the atoms in the primitive 
unit cell of the periodic crystal, alongside with the unit 
cell vectors. However, depending on the choice of prim¬ 
itive unit cell the set of interatomic distances will vary, 
since distances to neighboring atoms from different unit 
cells are not accounted for. Hence, such a representation 
is not unique in the sense that the same crystal can lead 
to different representations, and therefore is less likely to 
yield well performing ML models [4]. 


B. Ewald Sum Matrix 

We now consider a straightforward extension of the 
Coulomb matrix representation that removes the most 
obvious dependence on the non-unique set of interatomic 
distances in the primitive unit cell. We form an atom by 
atom matrix with one element for each pair of atoms in 
the primitive unit cell, but now each element is defined 
to represent the full Coulomb interaction energy corre¬ 
sponding to all infinite repetitions of these two atoms 
in the lattice. In this way, the elements in the matrix 
retain essentially the same meaning as in the Coulomb 
matrix, while the complete infinite repetition of the lat¬ 
tice is taken into account. As such, one can propose the 
following expression for the matrix elements, 

Xij = ^ZiZj y] (^(||rfc-rilla) ( 16 ) 

where the sum over k is taken over the atom i in the unit 
cell and its N closest equivalent atoms, and similarly for 
I and j. The intention is to take A" ^ oc to represent 


the full electrostatic interaction between the infinitely re¬ 
peated atoms equivalent to atoms i and j in the primi¬ 
tive cell. However, this type of infinite electrostatic sum 
has well known issues with convergence that have been 
discussed at length in the field of materials science. One 
resolution is given by the Ewald sum [33U35] . The central 


idea is to divide the problematic double sum in Eq. (16) 
into two rapidly converging sums and one constant. 


_ I I ^0 


Xij = 


(17) 


where x)-^ is the short range interaction calculated in 
real space, x[j^^ the long range interaction calculated in 
the reciprocal space, and x^j is a constant. The division 
is controlled by a screening length parameter a, which 
influences how rapidly the sums converge. The first term 
is given by 


4? = ZiZ,Y^ 


erfc(a||rt - rj + LII 2 ) 
llrj - r,' +LII 2 


(* j)t (18) 


where the sum is taken over all lattice vectors L inside a 
sphere of a radius set by a cutoff Lmax- The second term 
is 


(m) ZiZj 

G 


IIGIIi ™(G^(r.-r,)) 


(* 7^ i), 


(19) 

taken over all non-zero reciprocal lattice vectors G in a 
sphere of radius set by a cutoff Gmax, taken to be large 
enough for the sum to converge; and V is the unit cell 
volume. The last term is 


4 = -(Zi‘ + Z|)^ 




(* 7^ j), (20) 


where the first term in Eq. (20) is the Ewald self-terms 


for the i and j sites and the second term is a correction 
needed as we use a charged cell, since, in analog to the 
Coulomb matrix representation, the expressions describe 
the interaction from the positive atomic cores. The cor¬ 
rection makes the total energy be that of a system with 
a uniform compensating background that makes the sys¬ 
tem neutral. Eor the diagonal terms in the matrix {i=j) 
we take the Ewald sum interation energy of the lattice 
of Atype atoms, which is given by the same equations 
Eq. (18) and (19) but with an extra factor 1/2 and. 


n ^9 CL ^9 TT 

X% = -Zf— - Zl-^. 

2V 


( 21 ) 


The value of a only affects the rate of convergence in the 
above sums, not the final value of Xij. There are several 
suggested schemes for how to set it, in our work we take 


a = 


/O.OIM 


1/6 


( 22 ) 


where M is the number of atoms in the unit cell. 
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We refer to Eqs. (18) —(20) as the Ewald sum ma¬ 
trix representation. Our definitions are chosen to make 
each element of the matrix be the full Ewald sum of the 
Coulomb interaction between the sites i and j (or i with 
itself). We note briefly that the python materials ge¬ 
nomics (pytmatgen) open source library [36] contains a 
similar matrix as an intermediate step in the calculation 
of the full Ewald sum energy. However, that matrix differ 
from ours in that it is defined to make the sum over all 
elements give the total energy in a way that makes in¬ 
dividual matrix elements depend on the specific value of 
the a parameter used. This seems an undesirable feature 
for using the matrix as a descriptor. (Nevertheless, for 
the value a in our calculations, we use the same formula 
as in pymatgen, Eq. (22).) 


C. Extended Coulomb-like Matrix 



Another way to generalize the Coulomb matrix to pe¬ 
riodic systems is to extend the size of the representation 
matrix. Let each element be the electrostatic interaction 
between one of the M atoms in the unit cell and one 
of the atoms in the N closest unit cells, giving an M by 
N • M matrix representation on the regular Coulomb ma¬ 
trix form of Eq. (14), with 1 < i < M, I < j < N • M. 
To completely avoid the dependence on the chosen prim¬ 
itive unit cell, one would like to take ^ oo, i.e., 
an infinitely large representation matrix. This can be 
avoided if the long-range electrostatic interaction is re¬ 
placed with a more rapidly decaying interaction. Here, 
we chose (^(r) = e~^. In this way the elements of the ma¬ 
trix quickly drop to zero, and the representation matrix 
can be cut off at a finite dimension that is taken to be 
sufficiently large for all systems in the data set. We refer 
to this as the extended Coulomb matrix representation. 
One benefit of this representation over the Ewald sum 
matrix is that it is more straightforward to evaluate (one 
can simply iterate over N copies of the unit cell). 


D. Sine Matrix 


FIG. 1. An illustration of 4>(ri,r2) used in the sine matrix 
representation, Eq. (24), for a two-dimensional crystal lattice 
in a primitive unit cell shown by arrows. The figure shows the 
magnitude of our constructed ‘interaction’ between one atom 
at ri = {x^y) and another fixed at the origin r 2 = 0 (the 
latter shown along with its infinite repetitions as solid purple 
dots.) The interaction is periodic across the unit cells and 
grows to infinity as ri approach any repetition of the atom at 
the origin. 


finitely repeated grid of A and B atoms can be thought 
of as a potential field which is a function of the position of 
the atom A. Three important properties of this field are: 
(i) the expression as function of each atomic coordinate 
is periodic with respect to the crystal lattice; (ii) the con¬ 
tribution from two equivalent atoms in neighboring cells 
should be the same; (Hi) the potential should approach 
infinity when A takes the same position as B. The con¬ 
clusion is that the potential needs to be symmetric with 
respect to the lattice vectors. 

A possible choice for 4 >(ri,r 2 ) that fulfills these re¬ 
quirements is 


Yet another representation can be constructed by fur¬ 
ther extending the idea of reducing the computational 
effort by replacing the long-range electrostatic interac¬ 
tion by a simpler expression. We start from the M by M 
Ewald sum matrix in Eq. ( [T^ , but substitute the whole 
sums of the electrostatic interaction with an arbitrarily 
chosen two-point potential that is intended to share the 
same basic properties as such a sum, giving 

^fo.5zy ifz = j 

where r/ is the position of the atom in the unit cell. 

Consider two non-equivalent atoms in the unit cell, A 
and B. The Coulomb sum contribution due to the in¬ 


^(i’i,r 2 ) = ||B • y; 4 sin^ [7r4B ^ • (rj -r 2)]||2 


(24) 


k={x,y,z} 


where ex^ey^ez are the coordinate unit vectors and B 
is the matrix formed by the basis vectors of the lattice. 
The product inside the sine function thus gives the vector 
between the two sites expressed in crystal lattice coordi¬ 
nates, which gives the right periodicity in ri and r 2 . We 
call Eqs. (23) —(24) the sine matrix representation. The 


benefit of this representation over the others suggested in 
this work is that Eq. (23) is a completely straightforward 


M by M matrix that only depends on the positions of 
the atoms in a single unit cell. Hence, the computational 
load of this representation is minimal. Eigure[fJ shows 4> 
for a two-dimensional lattice. 

Note that we do not have a proof for the completeness 
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FIG. 2. The frequency of occurrence of various elements 
in the 3938 systems in the MP data set. The distribution 
is not uniform, but rather expected to reflect the occurrence 
of elements in published materials. The four most common 
elements are O, Si, Cu, and S. 



TABLE I. Optimal values of parameters A and a for the 
machines using the different feature vector representations in 
this work, and for a training set consisting of 3000 crystals 
drawn at random from the Materials Project. 


Representation 

a 

A 

Ewald sum matrix 

TcTTo® 


Generalized Goulomb matrix 

O 

T-1 

o 

00 

10 “® 

Sine matrix 

o 

T-1 

o 

10 “"^ 

Goulomb matrix for QM7 molecules 

2.5 • 10® 

10 “® 


ods for ML models with a larger set of parameters since 
the time it takes to find the optimum is of 0{x^) where p 
is the number of hyper parameters.) The optimal values 
found are shown in Table [T| The MAEtest for the different 
representations is not very sensitive to these parameters, 
Le., the regions around the minimums in the generated 
grids were relatively flat. 


V. RESULTS 


or uniqueness of these representations. They are merely 
constructed as sensible extensions of the Coulomb matrix 
idea. 


IV. IMPLEMENTATION DETAILS AND DATA 
SETS 

We have implemented the KRR ML scheme using a 
Laplacian kernel both for the original set of molecules 
from Ref.[T]with the Coulomb matrix representation, and 
for the discussed representations using a data set of cystal 
structures. We use the Python programming language, 
including numpy [37], scipy [38] and pytmatgen [36] . 

Our data set of formation energies of periodic solids 
was obtained from the Materials Project (MP) database 
[39] . We extracted 3938 systems from the MP without 
obvious order. Since the MP database is derived from 
the ICSD [40||4T], the distribution of elements in the ex¬ 
tracted systems roughly match their occurrence in pub¬ 
lished materials in the literature. This distribution is 
shown in Fig.[^ While this may be interpreted as a bias 
in the data set, one can also see it as representative for 
the intended application of ML in predicting properties 
of materials out of available material databases based on 
published materials. 

The machine requires values for the regularization pa¬ 
rameter A and kernel width a. We follow Hansen et al 
[ 2 ] ; optimal values were identified by calculating MAEtest 
for a training set size of 3000 for pairs of A and a values 
on a two-dimensional logarithmic grid, using a spacing 
factor of 2 for a and 10 for A. (This method works well 
when the model has a small number of hyper parameters, 
but needs to be replaced with more sophisticated meth- 


The performance of the representations of periodic sys¬ 
tems studied in this work are meant to be considered 
in the context of the excellent performance of the ma¬ 
chine for molecules presented in Refs. [T] and [ 2 ] To make 
this comparison clear we have reproduced the GE of 
their scheme with our implementation and the QM7 data 
set. The results are shown in Fig. [^ In Fig. [^ a two- 
dimensional PC A of the QM7 set using the Coulomb 
matrix representation is shown. Already with a train¬ 
ing set size of a couple of hundred atoms, the MAE ap¬ 
proaches the accuracy of DFT with the least computa¬ 
tionally expensive, semi-local, functionals. At a train¬ 
ing set size of 500, we arrive at a MAEtest of around 
0.07 eV/atom, whereas DFT with the PBE functional has 
an MAE for the atomization energy of small molecules 
of ca 0.15 eV/atom [29]. As demonstrated by Rupp 
et al. [Tj, the precision of the predictions of the ma¬ 
chine keeps improving with increased size of the training 
set, and at 3000 structures one finds an GE error below 
0.02 eV/atom. 

We now turn to the results of applying ML to peri¬ 
odic crystals using the feature vector representations in 
this work, shown in Fig. [^ The performance of all our 
representations are similar, but their accuracy is inferior 
to what we see for molecules. All the representations 
studied improve greatly with increased training set size. 
The MAEtest for fhe representations at a training set 
size of 3000 are 0.49 eV/atom for the Ewald sum ma¬ 
trix, 0.64 eV/atom for the extended Coulomb matrix, 
and 0.37 eV/atom for the sine matrix representation, 
ie., an order of magnitude worse than we find for the 
molecules. Furthermore, we find that MAEtrain for all 
the representations are insignificant (< 5 • 10“^) even at 
a training set size of 3000. 

Figure [^compares the two-dimensional PC A of the MP 
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FIG. 3. The mean absolute GE, Eq. (11), versus train¬ 
ing set size for the different representations considered in this 
work. Shown are MAEtest for predicting fo rma ti on e nergies of 
crystals in the MP data set: (Sine) Eqs. (23) —(Ewald) 
Eqs. (18) —(20); and (GGM) described in Sec. Ill G| Eor com¬ 
parison we also include (QM7), MAEtest for atomization ener¬ 
gies of molecu les in the QM7 data set using a regular Goulomb 
matrix, Eqs. (14) —(15). 
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EIG. 4. A two-dimensional PGA of the data sets used in this 
work: (QM7, green) the Goulomb matrix representation of the 
QM7 data set of molecules; (Sine, red) the sine matrix repre¬ 
sentation using the MP data set; (Ewald, blue) the Ewald sum 
matrix representation using the MP data set; (EGM, yellow) 
The extended Goulomb matrix data set using the MP data 
set. The PGA of the representations of the MP data set differ 
substantially from the PGA of the Goulomb matrix of QM7. 
The shading of the points represent the target value energy, 
showing that there is no clear energy-PGA pattern. 


data set to the QM7 one for the different representations. 
The QM7 PCA is localized to a set of small clusters. The 
MP data for the sine and the extended Coulomb matrix 
representations appear more uniformly spread out. The 
Ewald sum matrix PCA has a central cluster but also a 
few data points far removed from this cluster. 


VI. DISCUSSION 

The results in Fig. [^suggest that the sine matrix rep¬ 
resentation is slightly better than the other representa¬ 
tions in this work, in that it reaches a lower GE and 
has better development of the GE with training set size. 
However, the performance of the different methods are 
roughly similar, indicating that the selection between the 
feature vector representations presented in this work for 
a given application may not have major impact on the 
GE. This further strengthens the case of the sine matrix 
representation, since it is the representation that requires 
the least computational expense. It is interesting to note 
that while the PC As of the sine and extended Coulomb 
matrix bear strong resemblance, the Ewald sum PCA is 
distinctly different with a strong clustering and a few out¬ 
lier points. Furthermore, the GE of all representations 
systematically decreases with increasing training set size. 
These results suggest the GE could be reduced even fur¬ 
ther if only more extensive data set for training were 
available. 

When comparing the molecular results with the ones 


for solids, it is important to keep in mind that the di¬ 
versity and composition of the MP data set differs sub¬ 
stantially from that of QM7. In particular, only very 
few element types are present in the molecular data set 
consisting of atoms with very finite size, no molecule has 
more than seven atoms (not counting hydrogens). This is 
illustrated by the PCA in Fig.[^ where the QM7 data is 
collected in a few tight clusters. Arguably, the chemical 
space of QM7 is significantly smaller than the one we use 
to evaluate the representations for solids since the MP 
data set contains ten times more elements than QM7. 

Hence, the central conclusion of the present work is 
that there are three areas where the situation of ML 
for periodic systems can be improved: (i) our methods 
should be used with even larger data sets to confirm that 
the GE can be brought down to levels where it is use¬ 
ful for applications, (ii) further improved representations 
may be helpful if they more efficiently can represent the 
degrees of freedom offered by periodic crystals. Such rep¬ 
resentations may reduce the need for larger training sets; 
and (in ^ if a way of generating a data set over a restricted 
chemical space can be devised which is as compact as 
QM7, we may reach more promising ML performance for 
periodic systems. We are presently working in all of these 
directions. 



















VII. SUMMARY AND CONCLUSIONS 


We have investigated the performance of several crys¬ 
tal structure representations for ML models of formation 
energies of solids. Our work is a natural generalization 
of an ML scheme previously shown to be successful for 
the atomization energy of molecules. We have compared 
three different representations, and found that a sine ma¬ 
trix which simulates the features of an infinite Coulomb 
sum is both most efficient, and gives the smallest GE er¬ 
ror. While the performance of all the methods may at 
first seem disappointing when compared to the small GE 
confirmed for molecules, we can explain this discrepancy 
to be due to data sets which are too small to cover the 
hugely diverse compositional and structural space with 
sufficient density. As such, the full potential of ML for 
periodic systems still has to be demonstrated. The im¬ 
provement of the MAEtest with training set size suggests, 
however, that the methods presented here can lead to ac¬ 
curate machine models if only trained on larger or more 
restricted data sub sets. Hence, our results offer promis¬ 


ing indications that sufficiently accurate and transferable 
ML models of energies of periodic systems can be real¬ 
ized. 
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